In this vignette, we’re going to be looking at differential expression analysis. Here, we’ll focus on three different approaches to DE analysis:
For your projects, you’ll want to identify your independent variable, and use the counts matrix I provide here to perform differential expression analysis. This counts matrix is formatted such that the rows represent each gene, and columns represent each cell line (which is different from the other data we have worked with). I also changed the column names to the CCLE format instead of the DepMap ID.
CCLE_metadata <- read_csv("sample_info.csv")
## Parsed with column specification:
## cols(
## .default = col_character(),
## COSMICID = col_double(),
## Achilles_n_replicates = col_double(),
## cell_line_NNMD = col_double(),
## age = col_double(),
## depmap_public_comments = col_logical()
## )
## See spec(...) for full column specifications.
## Warning: 11 parsing failures.
## row col expected actual file
## 1105 depmap_public_comments 1/0/T/F/TRUE/FALSE Adherent version of CCLFPEDS0001T 'sample_info.csv'
## 1166 depmap_public_comments 1/0/T/F/TRUE/FALSE SV40+TERT immortalized kidney cell line 'sample_info.csv'
## 1454 depmap_public_comments 1/0/T/F/TRUE/FALSE Drug resistance: Dabrafenib and Trametinib 'sample_info.csv'
## 1455 depmap_public_comments 1/0/T/F/TRUE/FALSE Drug resistance: SCH772984 'sample_info.csv'
## 1456 depmap_public_comments 1/0/T/F/TRUE/FALSE Drug resistance: Dabrafenib and Trametinib 'sample_info.csv'
## .... ...................... .................. .......................................... .................
## See problems(...) for more details.
CCLE_mutation <- read_csv("CCLE_mutations.csv")
## Parsed with column specification:
## cols(
## .default = col_character(),
## Entrez_Gene_Id = col_double(),
## NCBI_Build = col_double(),
## Chromosome = col_double(),
## Start_position = col_double(),
## End_position = col_double(),
## isDeleterious = col_logical(),
## isTCGAhotspot = col_logical(),
## TCGAhsCnt = col_double(),
## isCOSMIChotspot = col_logical(),
## COSMIChsCnt = col_double(),
## ExAC_AF = col_double(),
## HC_AC = col_time(format = ""),
## RD_AC = col_logical()
## )
## See spec(...) for full column specifications.
# Downloaded CCLE_RNAseq reads from CCLE portal
# and processed in create_CCLE_count_data.R (SJ)
# where RNA read counts were rounded to allow for use in DESeq2
# NOte the use of the readRDS function to read in an R data-set.
CCLE_counts <- readRDS("CCLE_RNAseq_counts.rds")
# BE AWARE! Genes are on rows, samples are on columns
dim(CCLE_counts)
## [1] 56202 1015
head(rownames(CCLE_counts))
## [1] "DDX11L1" "WASH7P" "MIR1302.11" "FAM138A" "OR4G4P"
## [6] "OR4G11P"
head(colnames(CCLE_counts))
## [1] "22RV1_PROSTATE" "2313287_STOMACH"
## [3] "253JBV_URINARY_TRACT" "253J_URINARY_TRACT"
## [5] "42MGBA_CENTRAL_NERVOUS_SYSTEM" "5637_URINARY_TRACT"
For this comparison, we will be filtering by sample, and using the lineage as an independent variable by which we will be making comparisons. I will be using 3 lineages here.
# First filter for only cell lines with RNAseq counts data
# Then filter for specific lineages of interest
CCLE_metadata_lineage <- CCLE_metadata %>%
filter(CCLE_Name %in% colnames(CCLE_counts)) %>%
filter(lineage %in% c("prostate", "kidney", "urinary_tract"))
# Filter the counts to only those cell lines that match the above lineages
CCLE_counts_lineage <- CCLE_counts[,colnames(CCLE_counts) %in% CCLE_metadata_lineage$CCLE_Name]
# Re-order the metadata to match the column order in the counts data
CCLE_metadata_lineage <- CCLE_metadata_lineage[match(CCLE_metadata_lineage$CCLE_Name, colnames(CCLE_counts_lineage)),]
Now we can create a DESeq Dataset object, and run DESeq2.
DDS <- DESeqDataSetFromMatrix(
countData = CCLE_counts_lineage,
colData = CCLE_metadata_lineage,
design = ~lineage
)
## converting counts to integer mode
## Warning in DESeqDataSet(se, design = design, ignoreRank): some variables in
## design formula are characters, converting to factors
DDS.run <- DESeq(DDS)
## estimating size factors
## estimating dispersions
## gene-wise dispersion estimates
## mean-dispersion relationship
## final dispersion estimates
## fitting model and testing
## -- replacing outliers and refitting for 5140 genes
## -- DESeq argument 'minReplicatesForReplace' = 7
## -- original counts are preserved in counts(dds)
## estimating dispersions
## fitting model and testing
Because we did more than 2 comparisons (prostate, kidney, and urinary tract), we need to specify which comparison’s results we want to assess in the results.
We can see which comparisons have been done by running resultsNames
resultsNames(DDS.run)
## [1] "Intercept" "lineage_prostate_vs_kidney"
## [3] "lineage_urinary_tract_vs_kidney"
So we see a lineage_prostate_vs_kidney and a lineage_urinary_tract_vs_kidney. But why don’t we see a lineage_prostate_vs_urinary_tract? That’s because that was the default comparison that DESeq2 ran, and that comparison was used as the “Intercept”.
We can specify which comparison we want to view by using the contrast argument of the results() function. This is a vector of 3 elements: the first element is the name of the column in colData that is used for the design, and the 2nd and 3rd elements are the comparison to show in the results dataframe. Here, I use the contrast argument to look at the prostate vs urinary tract comparison.
res_urinary_vs_prostate <- results(DDS.run, contrast = c("lineage", "urinary_tract", "prostate"))
res_urinary_vs_prostate.df <- as.data.frame(res_urinary_vs_prostate)
head(res_urinary_vs_prostate.df[order(res_urinary_vs_prostate.df$padj, decreasing = F),])
## baseMean log2FoldChange lfcSE stat pvalue
## PRAC1 58.29014 23.444933 3.0034792 7.805925 5.906671e-15
## ABCA4 375.17967 -5.110946 0.8427724 -6.064444 1.324111e-09
## RP11.364B6.1 13.53323 18.941145 3.1611952 5.991767 2.075731e-09
## CTA.392C11.1 114.34276 -8.459799 1.4119371 -5.991626 2.077539e-09
## SLC7A7 1356.09235 -5.125757 0.8403110 -6.099833 1.061792e-09
## CITED4 875.88246 4.326437 0.7286355 5.937725 2.890050e-09
## padj
## PRAC1 1.670820e-10
## ABCA4 1.175347e-05
## RP11.364B6.1 1.175347e-05
## CTA.392C11.1 1.175347e-05
## SLC7A7 1.175347e-05
## CITED4 1.362514e-05
One interesting thing to note: we see that PRAC1, a gene associated with prostate-cancer susceptibility, is differentially expressed in these comparisons - which makes sense!
Another useful visualization tool might be a heatmap. There are several packages designed to create heatmaps in R, but we’ll be using heatmaply, which produces interactive plots using plotly. I’ll also show a more popular heatmap package, called pheatmap.
The heatmap we’re making here will show the top 10 most significant upregulated genes from each comparison. We’ll need to use normalized values, so we’ll be using processing the counts and log-transforming them to visualize everything in a heatmap. For more information on how DESeq2 normalizes the counts data, check out this link here: (HBC Training - DESeq2 normalization)[https://hbctraining.github.io/DGE_workshop/lessons/02_DGE_count_normalization.html]
DDS.norm <- estimateSizeFactors(DDS.run) # this creates normalization factors.
counts_norm <- counts(DDS.norm, normalized = T) # extract the normalized counts
counts_norm_log <- log2(counts_norm+1) # add a pseudo-count of 1
# # Subset for only the cell lines in the selected lineages
# CCLE_expression_TPM_lineage <- CCLE_expression_TPM %>%
# # Filter for the DepMap IDS we selected in CCLE_metadata_lineage
# filter(`X1` %in% CCLE_metadata_lineage$DepMap_ID) %>%
# # Convert the DepMap IDs to CCLE Names
# mutate(X1 = CCLE_metadata_lineage$CCLE_Name[match(`X1`, CCLE_metadata_lineage$DepMap_ID)]) %>%
# # Rename the column to accurately reflect what it is
# rename(CCLE_Name = X1)
# Now identify the top 10 upregulated, significant genes from each comparison
top10_kidney_urinarytract <- DDS.run %>%
results(contrast = c("lineage", "kidney", "urinary_tract")) %>%
as.data.frame() %>%
rownames_to_column(var = "Gene") %>%
filter(log2FoldChange > 0) %>%
arrange(padj) %>%
slice(1:10) %>%
pull(Gene)
top10_kidney_prostate <- DDS.run %>%
results(contrast = c("lineage", "kidney", "prostate")) %>%
as.data.frame() %>%
rownames_to_column(var = "Gene") %>%
filter(log2FoldChange > 0) %>%
arrange(padj) %>%
slice(1:10) %>%
pull(Gene)
top10_urinarytract_prostate <- DDS.run %>%
results(contrast = c("lineage", "urinary_tract", "prostate")) %>%
as.data.frame() %>%
rownames_to_column(var = "Gene") %>%
filter(log2FoldChange > 0) %>%
arrange(padj) %>%
slice(1:10) %>%
pull(Gene)
# Combine the top hits into a single vector
top10_genes <- c(top10_kidney_urinarytract, top10_kidney_prostate, top10_urinarytract_prostate)
# Remove any duplicate top hits
top10_genes <- unique(top10_genes)
# Now subset these genes from the log-transformed normalized counts
mat <- counts_norm_log[top10_genes,] # select the rows of each gene by name
# Look at your data quickly in an easy base-R heatmap:
image(mat) # not super useful... what are these axes???
# Let's use the heatmaply package:
check_installation("heatmaply")
## Loading required package: heatmaply
## Loading required package: plotly
##
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
##
## last_plot
## The following object is masked from 'package:AnnotationDbi':
##
## select
## The following object is masked from 'package:IRanges':
##
## slice
## The following object is masked from 'package:S4Vectors':
##
## rename
## The following object is masked from 'package:stats':
##
## filter
## The following object is masked from 'package:graphics':
##
## layout
## Loading required package: viridis
## Loading required package: viridisLite
## Registered S3 method overwritten by 'seriation':
## method from
## reorder.hclust gclus
##
## ======================
## Welcome to heatmaply version 1.1.0
##
## Type citation('heatmaply') for how to cite the package.
## Type ?heatmaply for the main documentation.
##
## The github page is: https://github.com/talgalili/heatmaply/
## Please submit your suggestions and bug-reports at: https://github.com/talgalili/heatmaply/issues
## Or contact: <tal.galili@gmail.com>
## ======================
##
## Attaching package: 'heatmaply'
## The following object is masked from 'package:BiocGenerics':
##
## normalize
## [1] TRUE
heatmaply(mat)
# Add colors to the top clustering dendrogram to indicate lineage
clust_lineages <- data.frame(lineage = CCLE_metadata$lineage[match(colnames(mat), CCLE_metadata$CCLE_Name)])
heatmaply(mat, col_side_colors = clust_lineages, plot_method = "plotly")
Heatmaply is a useful package for heatmap creation because the heatmaps are JavaScript-powered, and interactive! So you can use your mouse to zoom in, subset, and hover over each point to show important information. You can also save the image by clicking on the camera on the top right, which downloads the plot as a static png image.
On the sides and top, you may have noticed the dendrogram, or “tree-like” grouping for the genes and samples. This is a result of clustering the samples, using hierarchical clustering. To learn more about clustering, check out this video.
We’ll be using the hclust function to perform clustering here.
# Calculate distances between each GENE's expression pattern here using the dist function (because genes are on rows)
mat_dist <- dist(mat)
# Cluster based on the distance matrix
gene_clust <- hclust(mat_dist)
plot(gene_clust)
# Calculate distances between each SAMPLE's expression pattern here using the dist function (transpose because samples are on columns)
mat_dist <- dist(t(mat))
# Cluster based on the distance matrix
sample_clust <- hclust(mat_dist)
# Plot, using lineage as the label - extract lineage by matching with metadata
clust_lineages <- CCLE_metadata$lineage[match(colnames(mat), CCLE_metadata$CCLE_Name)]
# Note: cex changes the font size of the labels
plot(sample_clust, labels = clust_lineages, cex = 0.5)